In [1]:

    
%matplotlib inline
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import os

Hands-on: Linear Regression - Complex Shapes

Test 1
Input Features: x
Output / Target: y_noisy
Objective: Underfitting demo

Test 2
Input Features: x, x^2
Output / Target: y_noisy
Objective: How adding relevant features improves predicting accuracy



In [2]:

    
def quad_func (x):
    return 5 * x ** 2 -23 * x + 47



In [3]:

    
# Training Set + Eval Set: 200 samples (70%, 30% split)
# Test Set: 60 samples
# Total: 260 samples



In [4]:

    
np.random.seed(5)
samples = 260
x_vals = pd.Series(np.random.rand(samples) * 20)
x2_vals = x_vals ** 2
y_vals = x_vals.map(quad_func)
y_noisy_vals = y_vals + np.random.randn(samples) * 50



In [5]:

    
df = pd.DataFrame({'x': x_vals, 
                   'x2': x2_vals ,
                   'y': y_vals, 
                   'y_noisy': y_noisy_vals})



In [6]:

    
df.head()









    Out[6]:







  
    
      
      x
      x2
      y
      y_noisy
    
  
  
    
      0
      4.439863
      19.712387
      43.445077
      88.950606
    
    
      1
      17.414646
      303.269900
      1162.812637
      1193.704875
    
    
      2
      4.134383
      17.093124
      37.374807
      62.355709
    
    
      3
      18.372218
      337.538400
      1312.130983
      1254.553770
    
    
      4
      9.768224
      95.418196
      299.421832
      268.896012



In [7]:

    
df.corr()



In [8]:

    
fig = plt.figure(figsize = (12, 8))
plt.scatter(x = df['x'],
            y = df['y'],
            color = 'r',
            label = 'y',)
plt.scatter(x = df['x'],
            y = df['y_noisy'],
            color = 'b',
            label = 'y noisy', 
            marker = '+')
plt.xlabel('x')
plt.ylabel('Target Attribute')
plt.grid(True)
plt.legend()









    Out[8]:





<matplotlib.legend.Legend at 0x256c8271ac8>



In [9]:

    
data_path = '..\Data\RegressionExamples\quadratic'



In [10]:

    
df.to_csv(os.path.join(data_path,'quadratic_example_all.csv'),
          index = True,
          index_label = 'Row')

Training and Evaluation Set

Training Set 1: RowNumber, x, y_noisy
Training Set 2: RowNumber, x, x ** 2, y_noisy



In [11]:

    
df[df.index < 200].to_csv(os.path.join(data_path, 'quadratic_example_train_underfit.csv'),
                          index = True,
                          index_label = 'Row', 
                          columns = ['x', 'y_noisy'])



In [12]:

    
df[df.index < 200].to_csv(os.path.join(data_path, 'quadratic_example_train_normal.csv'),
                          index = True,
                          index_label = 'Row',
                          columns= ['x', 'x2', 'y_noisy'])



In [13]:

    
df.to_csv(os.path.join(data_path, 'quadratic_example_test_all_underfit.csv'), 
          index = True,
          index_label = 'Row', 
          columns = ['x'])



In [14]:

    
df.to_csv(os.path.join(data_path, 'quadratic_example_test_all_normal.csv'),
          index = True,
          index_label = 'Row', 
          columns = ['x', 'x2'])



In [15]:

    
# Pull Predictions
# Prediction without quadratic term
df = pd.read_csv(os.path.join(data_path,'quadratic_example_all.csv'), 
                 index_col = 'Row')
df_predicted_underfit = pd.read_csv(os.path.join(data_path, 'output_underfit',
                                                 'bp-pNYIAR35aSV-quadratic_example_test_all_underfit.csv.gz'))
df_predicted_underfit.columns = ["Row", "y_predicted"]



In [16]:

    
fig = plt.figure(figsize = (12, 8))
plt.scatter(x = df.x,
            y = df.y_noisy,
            color = 'b',
            label = 'actual', 
            marker = '+')
plt.scatter(x = df.x,
            y = df_predicted_underfit.y_predicted ,
            color = 'g',
            label = 'Fit (x)',
            marker = '^')
plt.title('Quadratic - underfit')
plt.xlabel('x')
plt.ylabel('Target Attribute')
plt.grid(True)
plt.legend()









    Out[16]:





<matplotlib.legend.Legend at 0x256c8459e10>

Test 1: Training RMSE: 385.18, Evaluation RMSE: 257.89, Baseline RMSE: 437.31 Wojciech results: Training RMSE: 385.16, Evaluation RMSE: 257.898, Baseline RMSE: 437.311

RMSE for the model is large and closer to baseline



In [17]:

    
fig = plt.figure(figsize = (12, 8))
plt.boxplot([df.y_noisy, df_predicted_underfit.y_predicted], 
            labels = ['actual','predicted-underfit'])
plt.title('Box Plot - Actual, Predicted')
plt.ylabel('y')
plt.grid(True)



In [18]:

    
df.y_noisy.describe()









    Out[18]:





count     260.000000
mean      492.434283
std       478.849813
min      -112.575294
25%        77.826912
50%       327.241317
75%       874.702202
max      1664.910364
Name: y_noisy, dtype: float64



In [19]:

    
df_predicted_underfit.y_predicted.describe()









    Out[19]:





count     260.000000
mean      662.497185
std       409.042715
min       -40.808170
25%       301.881100
50%       675.825400
75%      1036.082500
max      1354.002000
Name: y_predicted, dtype: float64



In [20]:

    
df_predicted_normal = pd.read_csv(os.path.join(data_path,'output_normal',
                                               'bp-In6EUvWaCw2-quadratic_example_test_all_normal.csv.gz'))
df_predicted_normal.columns = ["Row", "y_predicted"]



In [21]:

    
fig = plt.figure(figsize = (12, 8))
plt.scatter(x = df.x,
            y = df.y_noisy,
            color = 'b',
            label = 'actual', 
            marker ='+')
plt.scatter(x = df.x,
            y = df_predicted_underfit.y_predicted,
            color = 'g',
            label = 'Fit (x)',
            marker = '^')
plt.scatter(x = df.x ,
            y = df_predicted_normal.y_predicted ,
            color = 'r',
            label = 'Fit (x,x^2)')
plt.title('Quadratic - normal fit')
plt.grid(True)
plt.xlabel('x')
plt.ylabel('Target Attribute')
#plt.legend()









    Out[21]:





<matplotlib.text.Text at 0x256c859e1d0>

Test 1: Training RMSE: 385.16, Evaluation RMSE: 257.89, Baseline RMSE: 437.31

Test 2: Training RMSE: 132.20, Evaluation RMSE: 63.68, Baseline RMSE: 437.31

Test 2 RMSE is much better compared to baseline. Do note that we added approx -50 to 50 noise value to y



In [22]:

    
fig = plt.figure(figsize = (12, 8))
plt.boxplot([df.y_noisy,df_predicted_underfit.y_predicted, df_predicted_normal.y_predicted], 
            labels = ['actual','predicted-underfit','predicted-normal'])
plt.title('Box Plot - Actual, Predicted')
plt.ylabel('y')
plt.grid(True)



In [23]:

    
df_predicted_underfit.head()









    Out[23]:







  
    
      
      Row
      y_predicted
    
  
  
    
      0
      0
      269.2752
    
    
      1
      1
      1177.0090
    
    
      2
      2
      247.9033
    
    
      3
      3
      1244.0020
    
    
      4
      4
      642.0548



In [24]:

    
df_predicted_normal.head()









    Out[24]:







  
    
      
      Row
      y_predicted
    
  
  
    
      0
      0
      53.94965
    
    
      1
      1
      1201.89800
    
    
      2
      2
      44.75586
    
    
      3
      3
      1345.26700
    
    
      4
      4
      346.27510

Summary

Underfitting occurs when model does not accurately capture relationship between features and target
Underfitting would cause large training errors and evaluation errors.
Training RMSE: 385.1816 Evaluation RMSE: 257.8979. Baseline RMSE:437.311
Evaluation Summary - Prediction overestimation and underestimation histogram provided by AWS ML console provides important clues on how the model is behaving. Ideally, under-estimation and over-estimation needs to be balanced and centered around 0.
Box plot also highlights distribution differences between predicted and actual
To address underfitting, add higher order polynomials or more relevant features to capture complex relationship
Training RMSE: 132.2032 Evaluation RMSE: 63.6847. Baseline RMSE:437.311
When working with datasets containing 100s or even 1000s of features, it important to rely on these metrics and distribution to gain insight into model performance.

	x	x2	y	y_noisy
x	1.000000	0.968304	0.948299	0.940940
x2	0.968304	1.000000	0.997515	0.991770
y	0.948299	0.997515	1.000000	0.994777
y_noisy	0.940940	0.991770	0.994777	1.000000

	x	x2	y	y_noisy
0	4.439863	19.712387	43.445077	88.950606
1	17.414646	303.269900	1162.812637	1193.704875
2	4.134383	17.093124	37.374807	62.355709
3	18.372218	337.538400	1312.130983	1254.553770
4	9.768224	95.418196	299.421832	268.896012